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Abstract 

Multinomial-response models are available that correspond implicitly to tests in which a 
total score is computed as the sum of polytomous item scores. For these models, joint and 
conditional estimation may be considered in much the same way as for the Rasch model for 
right-scored tests. As in the Rasch model, joint estimation is only attractive if both the 
number of items and the number of examinees are large, while conditional estimation can 
be employed for a large number of examinees whether or not the number of items is large. 
In neither case is computation difficult given currently available computers. Large-sample 
results favor use of conditional estimation, although some use of joint estimation can be 
contemplated if the number of items is large. 
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Introduction 

Probability models based on exponential families are readily constructed for tests in 
which the total score is the sum of polytomous item scores. In these models, which are 
described in Section 1., the total scores for the examinees are part of the sufficient statistics 
for the model. These models can be employed to assess what information, if any, can 
be obtained concerning examinees that is not revealed by the total score. In addition, 
these models typically have parameters that correspond to such common concepts in item 
response theory as examinee ability and item difficulty. The models themselves have existed 
for some time (Bock, 1972; Andrich, 1978; Masters, 1982; Andersen, 1983); however, the 
appropriateness of joint and conditional estimation has not been extensively studied for 
these models in the common case in which both the number of examinees and the number 
of items are large. In this report, two aspects of joint and conditional estimation are 
considered for this case. Consistency and asymptotic normality of parameter estimates 
are explored for these methods, and computation of estimates is considered. Marginal 
estimation for this class of models is not considered in this report, and this topic does merit 
study. This report confines attention to techniques that do not require assumptions about 
the underlying ability distribution. 

Joint and conditional estimation proceed in much the same way as for the Rasch 
model for binary responses (Rasch, 1960; Haberman, 1977, 2004). As in the Rasch 
model for binary responses, straightforward application of maximum likelihood presents a 
number of complications if no restrictions are imposed on the ability distribution, so that 
joint maximum likelihood and conditional maximum likelihood will receive considerable 
attention. 

Section 2. examines joint maximum-likelihood estimation (JMLE). Results rely heavily 
on previously derived results for the binary Rasch model (Andersen, 1972; Fischer, 1981; 
Haberman, 1977, 2004). As expected, JMLE does not lead to fully satisfactory approximate 
confidence intervals for item parameters, and the normal approximation for the distribution 
of ability estimates is not fully satisfactory. Nonetheless, joint estimation does have possible 
use in construction of starting values for conditional estimation. 

Section 3. examines conditional maximum-likelihood estimation (CMLE) for the 
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models under study. Techniques are based on those for the binary Rasch model 
(Andersen, 1972, 1973a, 1973b; Fischer, 1981; Haberman, 2004). Basic properties of 
conditional maximum-likelihood estimates are readily examined. Computation with the 
Newton-Raphson algorithm is only moderately more complicated than for the binary Rasch 
model provided that convolutions are used and starting values exploit joint estimation. 
Normal approximations for estimates of item parameters are established that apply whether 
or not the number of items increases. 

To illustrate results, data from Form A of the TOEFL® field trial are used. To compare 
estimates, the reading and listening sections are examined as a single test. Although the 
preponderance of items have simple right scores, one reading item has integer scores from 0 
to 4, one has integer scores 0 to 2, and one has integer scores 0 to 3. Two listening items 
have integer scores 0 to 2. In all, 71 items are scored on 2,720 examinees. Use of the 
single test provides a better opportunity for joint estimation than is afforded by a separate 
reading and a separate listening tests. In addition, it is easily verified that the listening and 
reading test results are very highly correlated, so that it is not obvious that much error is 
introduced by combining the scales. The actual loss of information from this step is to be 
considered in a separate paper. 

Section 4. summarizes the implications of the research for psychometric practice and 
discusses some further areas of possible development. 

1. Models for Polytomous Scoring 

In the basic model for polytomous scoring, the number of possible scores per item may 
vary, but it is required that all scores be rational numbers. This requirement is necessary for 
conditional estimation. The model examined is the nominal model (Bock, 1972; Andersen, 
1983). The rating scale model (Andrich, 1978) and partial credit model (Masters, 1982) are 
considered in relationship to the basic model. 

In the model under study, the score of examinee *, 1 < % < n, is a sum of scores assigned 
to each of q items. For each item j and examinee i, let the response be denoted by the 
integer Y ty For some integer rq > 2, let 0 < Y Z] < rj — 1, and for each Y t j, let the possible 
scores on item j be Uj(k), 0 < k < r 3 — 1 . Let the score of examinee i on item j be 


2 



u ij = Uj(Yij), so that the total examinee score is 

<2 

Si = . 

3 = 1 

In the TOEFL case with reading and listening scores combined, q — 71 and tv,- = 2 except 
for Items 11, 25, 38, 42, and 58. For Item 11, rn = 5; for Item 38, r :i % = 4; and for Items 
j equal to 25, 42, and 58, rq = 3. In the TOEFL example, Uj{k) is always k. In the math 
and verbal sections of the SAT® I examination, which was recently replaced with the SAT 
Reasoning Test™, a somewhat more complex system of scoring is used. If item j is a 
multiple-choice item with dj > 1 alternatives, then a score of 1 is used for a correct answer, 
a score of — 1 /(dj — 1) is used for an incorrect answer, and a score of 0 is used for an omitted 
response. In grid-in responses, a score of 1 is used for a correct response. No response or an 
incorrect response receives a score of 0. It is easily seen that the SAT scoring method is a 
special case of the scoring method considered in this paper. LInlike the TOEFL example, 
the scores for items are not necessarily integers and are not necessarily nonnegative. 

It is assumed that the vectors Y* with coordinates Yij, 1 < j < q, are mutually 
independent and identically distributed. It is also assumed that, for each item j, the 
possible scores Uj[k) are not all equal, so that the item response can change the total score. 
If Uj is the arithmetic mean 

1 

% = -'52 u i( k )’ 

rj k =i 

then 

r i 

Uj = -Uj \ 2 > °- 

k =1 

Because conditional estimation is often considered, the added assumption is made that each 
score Uj(k) is equal to Uj n {k) / Ujd(k) for an integer Uj n (k ) and a positive integer Ujd(k), so 
that Uj(k) is a rational number. This requirement is needed to permit useful inferences 
conditional on the examinee scores Si. 

In the nominal model, to each examinee i corresponds an unknown ability parameter 9 t , 
and the 0i are independent random variables with common unknown distribution function 
D. Given 6i , the Y tJ , 1 < j < q, and the 6h, h ^ i, are mutually independent. To each 
item j correspond unknown item parameters f3jk , 0 < k < rj — 1 . To construct vectors to 
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use with these parameters, let R 0 = 0 and Rj = Rj-\ + rj for 1 < j < q. Let (3 be the 
vector of dimension R q with coordinate ((j, k ) = Rj-i + k + 1 equal to (3j k , 1 < j < q and 
0 < k < Tj — 1 . Under the nominal model, the conditional probability that Y t] = k given 6 t 

is 

Pijk = Pjk{(3 , 0j), (1) 

where 



and 

Pjk{P, 0%) = ej{P, Oi) exp [9iUj{k) - / 3 jk ] (3) 

(Bock, 1972; Andersen, 1983). To permit identification of parameters, the convention is 
adopted that /3 is in the set A of R q - dimensional vectors x with coordinate ((j, k ) equal to 
Xjk such that Y^k=o TA = 0 f° r ea °h item j and Y^k=ol u i(k) ~~ u-i] x ik = 0. 

Conditional on the 0,, sufficient statistics for the observed Y^, 1 < i < n, 1 < j < q, 
are the examinee scores Si and the number of examinees fjk with Y t] — k, 0 < k < rj — 1, 

1 < j < q, and 1 < i < n. The nominal model is the model implicitly defined by the 
requirement of sufficiency of the 5) and the fjk given the 0* (Gilula & Haberman, 2000). 

Special cases of the nominal model for polytomous item scores can be found in the 
literature. The Rasch model for binary data arises if r 3 = 2 and Uj(k) = k for each j 
(Rasch, 1960). In this case, the identifiability restrictions are equivalent to the requirements 
that f3ji = — Pj 0 for j > 2 and /3 W = (i\\ = 0. In the partial credit model, Uj(k) = k and rj 
is a constant r, so that XlI-=o Pjk — 0 and Ylk=o[^ ~ ( r ~ l)/2]/9ijt = 0 (Masters, 1982). In a 
version of the rating scale model, rj is a constant r, Uj(k) is independent of j, and 

Pjk = Pk - Vj[ui(k) - Mi] 

for unknown fik and Uj such that Y^k^oPk = 0 and Ylk~oi Ul (k) ~ Mi]^ = v\U\. The Uj are 
item difficulties. The conditions on are satisfied if /i r = 0, Wi(r) = 0, and 

r 

V\ = U~ l Yj Ul (k) - Ui]fj, k 

k= 1 

(Andrich, 1978). 
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For all versions of the nominal model, the probability that Y* has a specific value c is 
readily calculated. Consequently, a log likelihood function can be obtained. To calculate 
the desired probability, let T be the set of (/-dimensional vectors c with integer coordinates 
Cj, 1 < j < q, such that 1 < c j < rj. Then for c in T, the probability pj(c) that Yj = c is 

Pj( c ) = E 

\j =i 



For a more explicit expression, let 

£( c ) = ^ u j( c j) 

3 =1 


be the score S, = 5(c) obtained if Yj = c. Let 5 be the set of possible values of 5*, so that 
s is in 5 if, and only if, s — 5(c) for some c in T. Let A be the number of elements of 5, 
and let s(a) be the oth smallest element of 5 for a from 1 to A. For s in 5, let T(s) be 
the set of c in T such that 5(c) = s. For i? 9 -dimensional vectors x and y with respective 
coordinates ((j, k) equal to Xjk and y^, 1 < j < q and 0 < k < rj — 1, let 

q r i~ l 

x 'y = EE %jkVjk • 

j =1 k =0 


For any c in T, let Z ]k (c) be 1 for c 3 = k and 0 otherwise, and let Z(c) denote the 
i? 9 -dimensional vector with coordinate Q(j, k) equal to Zj k { c), 1 < j < q and ()</;;< r 3 — 1. 
Then 


r, -1 


Pte ~ S S Zjk(c)(5j k /3 c. 

j= 1 3=1 ^=0 


Let 


for s in 5, let 


and let 


M a {0) = ex P[-^' z ( c )] 

cer(s) 

$ (/3, 0) = exp (ds)M s ((3), 

t s = log J [ < h(/3,6')] _1 exp (9s)dD(9). 


( 4 ) 


( 5 ) 


5 



Let r be the ^4-dimensional vector with coordinate a equal to r s ( a ) for a from 1 to A. For 
any c in T(s) and any s in S, 

Pj( c) = exp[-/3'Z(c) + Tg]. ( 6 ) 

Let pj be the array of pj(c) for c in T. Let 5 consist of all p j such that (6) holds for c in 
T(s) and s in S , where t s satisfies (5) for s in S for some f3 in A and some distribution 
function D. To obtain the log likelihood function £(pj), let Z l]k be 1 if Y t] = k and let 
Zjjk = 0 if Yij 7 ^ k , and let 

n 

z~\~jk y ^ Zjj k 

i =1 

be the number of examinees i who provide response k to item j. Let Z + be the 
-Rq-dimensional vector with coordinate ((j, k ) equal to Z + j k , 1 < j < q and 0 < k < rj — 1 . 
Let Ns(s ) be the number of examinees i with total score Sj = s, and let Ns be the array of 
N s (s ) for s in S. Let N 5 be the vector of N s (s(a)), a from 1 to A, and let 

N' s r = J2 N s( s )ts- 

s£S 

Let pj be the array of pj(c) for c in T. Then 

n 

£ (p j) = l°g Pj{Yi) 

i =1 

= -/3'Z + + N' s r. 

Thus Z + and Ns are jointly sufficient for (3 and D. 

Use of maximum likelihood with the nominal model is far from straightforward due to 
the integrals involved in the definition of t s and due to the lack of identihability of the 
distribution function D. The problem is similar to difficulties encountered with the Rasch 
model (Cressie & Holland, 1983; Haberman, 2004). The probability ps(s) that Si — s is 

ps(s) = M s (f3) exp(r s ), 

so that 

^M s (/3)exp(r,) = J^Ps(s) = 1 . 

s£S sGiS 
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Because each Uj(k) is rational, a largest positive rational number B exists such that 
any member s of S is s(l) + hB for a nonnegative integer h < [s(A) — s{l)\/B. In the 
TOEFL example under study, B is 1. If s = s(l) + hB for an integer h and s is in S, then 
exp(r s — r s (i)) is the hth moment of a positive random variable X such that the probability 
that log A" <y,y real, is 

/! oo [«(/3.«)]- 1 exp[ s (l)e]<iB(0) 

JAW/M)] -1 exp(s(l)0]dD(£>)' 

Because only a finite number of moments are specified by the ratios exp(r s — t s (i)), it 
follows that more than one distribution function D corresponds to the same (3 and t s , s in 
S. On the other hand, if a positive random variable X exists such that exp(r s — r s (i)) is the 
hth moment of X whenever s = s(l) + hB is in S and if G is the distribution function of 
log A, then (5) holds if 

!^{0,e)ex P (-s(i)e)dG(e) 

/A® (/3,0)exp(-s(l)0)i(G(0)' 

The nominal model implies the log-linear model in which, for some (3 in A, and real t s , 
s in S, 

logpj(c) = ~(3'Z(c) + t s (7) 

for c in T(s) and s in S and 

J^M s (/3)exp(r s ) = 1. (8) 

sGiS 

Let S + consist of all p,/ such that (7) holds for some (3 in A and some t s , s in S, such that 
(8) holds. Then p j satisfies the log-linear extension of the nominal model if, and only if, p / 
is in S + . On the other hand, the log-linear model does not imply the nominal model, for 
the nominal model can only hold if t s satisfies the convexity condition that 

t s < an + (1 - a)T u 

whenever s = at + (1 — a)u, s, t, and u are in S, and 0 < a < 1 (Feller, 1966, p. 153). 
Thus S is a proper subset of S + . For the log-linear extension of the nominal model, the 
log-likelihood i(p.j) has the same form as in the nominal model, and the sufficient statistics 
remain Z + and Ng. 
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2. Joint Maximum-Likelihood Estimation 


In JMLE, the ability parameters 9 t are regarded as fixed parameters to be estimated. 
The estimates of the 9 t are then used to estimate the distribution function D. Use of joint 
maximum likelihood has a long history of controversy in many areas of statistics (Kiefer 
& Wolfowitz, 1956). In many circumstances, joint maximum likelihood is relatively easily 
implemented; however, consistency of estimates is a major concern, especially if the number 
of items is fixed and the number of subjects increases. Consistency issues can be resolved if 
both the number of items and the number of subjects increases, a result that is known in 
the special case of the binary Rasch model (Haberman, 1977, 2004). In this section, it is 
shown that joint estimation is rather unsatisfactory in terms of consistency if the number 
of items is not large, but joint estimation can lead to consistent and asymptotically normal 
parameter estimates if both the number of items and the number of examinees is large. 

To simplify large-sample results, a number of boundedness assumptions and convergence 
assumptions are made. To begin, it is assumed that the 6^ are bounded, so that D(x ) is 0 
for x sufficiently small, and D(x) = 1 for x sufficiently large. It is also assumed that the 
/3jk, Uj n (k), Ujd(k), and Xj are uniformly bounded if q goes to oo, so that B has the same 
value for all q sufficiently large. Let r max be the largest value of x 3 for any j > 1. It is 
assumed that, for each integer r < r max , the fraction of items j with Xj = x approaches a 
constant fj as q increases and the empirical distribution of (3-, Xj = r, converges weakly to 
the distribution of the r-dimensional random vector (3* r . The assumptions made imply that 
constants s*_ and exist such that s(l)/q converges to s*_ and s(A)/q converges to s* + . 

To define joint estimation, let p denote the array of pijk, 1 < i < n, 1 < j < q, 

1 < k < Xj. The joint log likelihood function 

n q 

0(p) = ££ Ajjk log Pijk 

i =1 k =1 

is maximized under the model constraints. In the expression for £j( p), note that Z l3 y. \ogp i: jk 
is log pijh if Y{j = h. The resulting maximum £jm under the constraints from (1) is achieved 
if, and only if, (3jk = f3jk and 9i = 9i for f3jk and d, such that /3 is the Rq-dimensional vector 
with coordinate ((j, k ) equal to (3jk , 


Pijk Pjk($, 9j ), 


( 9 ) 



/3 is in A, 


q r j - 1 

EE u j(k)pijk = Si ( 10 ) 

j =1 fc=0 

and 

n 

Z+jk P+jk ^ j Pijk (11) 

i=l 

(Haberman, 1977). If the j3jk and 6, t exist, then they are uniquely defined. The /3jk 
are the JMLEs of the f3jk , and the dj are the JMLEs of the 9i. The vector (3 is the 
maximum-likelihood estimate of /3. 


2.1 Computations and Collapsed Tables 

Computation of JMLEs is greatly simplified by use of a collapsed table. Let iS + be 
the set of s in S such that Ns(s ) > 0. Consider the array with entries f S jk for s in <S +> 

0 < k < rj — 1 , and 1 < j < q, such that f S] k is the number of examinees i, 1 < i < n, such 
that Si = s and Y tl = k. For real x and y, let S x (y) be 1 for x = y and 0 otherwise. Observe 
that 

q rj — 1 q rj-1 n 

EE E E ^ k > E 5s (S j ) tij ( k ) Z ijk 

j= 1 k =0 j =1 k =0 i= 1 

n q rj-1 

= E*-®EE tij ( k ) Zijk 

i=1 j=l k =0 

= sN s (s ) 

for s in S + , and the sum 

ha = 

n 

= EE S,(Si)Z ijk 

1- 7— 1 

n 

= E^‘E { '® 

i=l 
Z+j k ■ 
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Consider maximization of the collapsed log likelihood 


<? rj-l 

£jc (pc) = 5^ 5^ 5^ fsjk lo§ PsjkC i 
sS5+ j= 1 k=0 

for pc the array of p S jkC, s in 5+, 0 < k < rj — 1, 1 < j < q, with the constraints that 

PsjkC Pjk^f^i @sc') 

and (3 is in A. Let ijcM be the supremum of Ijq . Then Ijcm = fjc if; and only if, /3 = (3 C 
and 9 s c = 9 s c , where (3 C is in A, 

PsjkC = PjkiPci ®sc) (12) 

for s in <S + , 0 < k < r 3 — 1, and 1 < j < q, 

^ ' Ns^S^PsjkC f+jk Z+jk (13) 

sE«S-|_ 

for 0 < k < rj — 1 and 1 < j < q, and 

q r i ~ 1 

EE u j(k)p S jkC = s (14) 

j=1 k=0 

for s in d> + (Haberman, 1977, 2004). The vector (3 C is uniquely defined if it exists. The 9 s c 
are uniquely defined for s in d> + if they exist. 

Because p S jkc is positive and less than 1 for s in S +1 0 < k < rj — 1, and 1 < j < q, (14) 
does not hold if s(l) or s(A) is in S+, so that some examinee i exists such that either 

u j( Y ij) = u j+ = 0< max_ i Uj (k) 

for all items j or 

u j( Y ij) = u 3- = 0< ^ 

for all items j. In the case of the TOEFL examination, A = 80, s(l) = 0, and s(A) = 79. 
One examinee achieved a total score of 79, so nonexistence is an issue. 

The relationship of 9 s c and (3 C to the corresponding joint maximum-likelihood estimates 
is straightforward. If 9 s c and j3 c exist, then 9 % = 9 s c for Si = s and (3jk = Pjkc- If fb e 
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(3jk and 9i exist, then /3j k c = ftjk an d 9 s c — for s — Sb Thus joint maximum-likelihood 
estimates are readily found by maximization of £jc- 

From a computational standpoint, the collapsed table has major impact, for one can 
compute joint maximum-likelihood estimates by acting as if a multinomial response model 
holds with independent arrays f S j k , 0 < k < rj — 1, with sample size Ng(s ) and with 
probabilities p S j k , 0 < k < rj — 1 , for 1 < j < q and s in S + , where 

Psjk Pjk(/3, 9sc')- 

Instead of an n by q array of responses Y t] . it suffices to consider the array of counts f S jk■ 
If r + = fj, then the array of f S jk has no more than Ar + elements. For instance, in 
the TOEFL example, the array has no more than 80 x 150 = 12, 000 entries, but there 
are 2, 720 x 71 = 193,120 responses Y t j to consider. The array of f S jk also assists in the 
study of existence of joint maximum-likelihood estimates and in the study of large-sample 
properties of JMLE. Given existing software for multinomial response models, computation 
of /3 = f3 c and 9 s c, s in S+, is straightforward. 

2.2 Existence of Joint Maximum-Likelihood Estimates 

Existence of joint maximum-likelihood estimates is a substantial problem in practice. 
To study the issue, standard results from the theory of log-linear models are used as in the 
following theorem (Haberman, 1974, chap. 2): 

Theorem 1 Joint maximum-likelihood estimates exist if, and only if, a table of positive g S jk, 
0 < k < rj — 1, 1 < j < q, s in S + , exists such that g + j k = f + j k for 0 < 7 < r 3 — 1 and 
1 <j<q, Y2=o Ssjk = Ns(s) for 1 < j < q and s in S + , and 

q r i 

EE 9sjk sNg(s') 

j =1 k =0 

for s in S. 

It is clearly true that joint maximum-likelihood estimates exist if f S j k is positive for all 
s in iS + , 0 < k < rj — 1, and 1 < j < q, for one may just take g S j k = f S j k . It is clearly true 
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that joint maximum-likelihood estimates do not exist if f + j k is 0 for some j and k or if s(l) 
or s(A) is in S + . 

These results suffice to indicate that joint maximum-likelihood estimates do not exist 
for the TOEFL example, for Ns(s(A)) = 1 > 0. Thus a more general approach to joint 
estimation is required for the TOEFL data. 

2.3 Extended Joint Maximum-Likelihood Estimates 

Without any conditions, extended joint maximum-likelihood estimates p^k of pij k may 
be defined such that 0 < p^k < 1, pij + = 1, p+jk = Z + j k , 

n q rj-1 

EEE Zijk log pijk = PjM, 

i =1 j= 1 k =0 

and real 9 iU) 1 < i < n, and f3 u in A exist for v > 0 such that 

Pjkiftv, Oiu ) 

approaches pij as v approaches oo (Haberman, 1974, pp. 402-404). The definition of 
extended joint maximum-likelihood estimates p t jk is consistent with the previous definition 
of pij k when joint maximum-likelihood estimates exist. The p.^k are uniquely defined. In 
addition, if real 6 l0u and (3 0u in A exist for 1 < i < n and v > 0 such that 

Pjk(Povi 9i0u) 

approaches pijko as v approaches oo, 

q r j — 1 

EE Uj ( PjpijkO Si, 

j=l k =0 

and p+jko = Z + jk, then p^ko = pijk■ In terms of the collapsed table, p s jkc are defined so 
that (13) and (14) hold, and p.^k = psjkc f° r = s. In this case, 

q r i ~ 1 

EEE fsjk l°g PsjkC = V-JCM- 

s£iS4- j= 1 k =0 

The estimates p SJ kC may be used to create estimates 9 t -, 6 s c, and (3 = (3 C . These 
estimates may be infinite in some cases. For instance, d* = 9 s c = oo if Si — s(A), and 
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9i = 9 s c = — oo if Si = s(l). If the p s \kc are positive for all k from 1 to r i and S t = s, then 

r\ — 1 

= I/f 1 y^[^i (k) - iii] logp sU .. 

fc=0 

If, in addition, the p SJ k are positive for 0 < k < rq — 1 for some j > 1, then coordinate 
C(j, fc) of (3 is 

1 

Pj k = ~ log (p sj kc + rj 1 psjk'c) + O a [uj(k) - Uj\. 

k'=0 

2.4 Consistency 

Even if (1) holds, if the number q of items is constant, the /3jk are constant, and n 
approaches 00 , then the 9i are not consistent estimates of the 9 t , and the f3jk are not 
consistent estimates of the f3jk- This outcome is predictable given results for the Rasch 
model for binary data (Andersen, 1973a, pp. 66-69). Indeed, the probability approaches 
1 that ordinary maximum-likelihood estimates do not even exist, a result expected given 
similar results for the binary Rasch model (Haberman, 1977). 

A much more subtle problem arises if the number q of items increases as the number 
n of examinees increases. Given previous results for the binary Rasch model (Haberman, 
1977, 2004), it is reasonable to expect that consistency results would be available in 
this case. As shown in this section, this expectation is indeed fulfilled. One finds that 
maxi<j<q maxtKAK^-i \fi 3 k — Pjk\ converges in probability to 0 , and, for any given examinee 
i, 9i — 9i converges in probability to 0. It also follows that the empirical distribution 
function D of the 9 3 converges weakly with a probability of 1 to the distribution function D 
of 9 V 

A fixed number of items. For a fixed number of items, the existence issue is quite 
straightforward for ordinary joint maximum likelihood, although consistency requires a 
more careful argument. Consider the following theorems. 

Theorem 2 Let the number q of items be fixed, and let the number n of examinees approach 
00 . Then the probability that joint maximum-likelihood estimates exist approaches 0. 

Proof. Let P Sl s in S, be the unconditional probability P(Si = s ) that Si = s. Then 
each P s is positive, so that the probability is positive that examinee i has either a minimal 
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formula score S, = s(l) or a maximal formula score s(A). Joint maximum-likelihood 
estimates only can exist if s(l) < Si < s(A) for each examinee i from 1 to n. The 
probability that s(l) < Si < s(A) for 1 < i < n is [1 — (P s (i) + P s (A)] n - As n approaches 0, 
this probability approaches 0 . 


Theorem 3 Under the conditions of Theorem 2, for any integer i > 1, d* — 9 t does not 
converge in probability to 0. 

Proof. If Si = s(l), then 9 t = — oo. If Si = s(A), then 9 t = oo. Because the probabilities 
P s (i) arid P S (A) defined in the proof of Theorem 2 are positive and constant and because 
9 l — —oo with probability at least P s (i) and 9 t = oo with probability at least P s (a), it follows 
that 9i — 9i does not converge in probability to 0. 

The inconsistency of (3 is less obvious in the case of extended joint maximum-likelihood 
estimates. Some insight is readily provided through an examination of the statistical 
properties of the counts f S jk■ This examination can be used to show that (3 converges with 
a probability of 1 to a limit (3 M that is not necessarily (3. Demonstration of this claim 
requires a study of the expectation E(f S jk ) of f S jk . If m S jkc is the conditional expectation 
of Zijk given S, = s, then E(f S j k ) is nP s m S j k c■ As in the Rasch model, m k jc depends on 
the array (3 of item parameters but not of the examinee ability 9 t . Let 

M sj k(f3) — ^2 4 ( 9 )exp[-)9 , Z(c)] 

cer(s) 

be the partial derivative of M s ((3 ) with respect to (3j k , where M s ((3 ) is defined as in (4). 
The conditional probability that Y] = c in T(s) given Si = s is 

Pjc{ c) = [Af s (/3 )] _1 exp[—/3'Z(c)], 


so that m S jkc is 


m S jk((3) = 


M sjk ((3) 


MM ‘ 

Normally m S j k is positive; however, m S j k {(3 ) is 0 if s = s(l) and Uj(k ) > Uj_ or s — s(A) 
and Uj(k) < Uj + . With these preliminary results, the following theorem is available. 
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Theorem 4 Under the conditions of Theorem 2, 6 s c converges almost surely to d sM , s in 
S, and (3 converges almost surely to (3 M , where 6 s (i)m = —oo, O s (a)m = oo, real 0 sM , s in S, 
s(l) < s < s(A), (3 m in A, and real p S jkM, s in S, 1 < j < q, and 1 < k < Vj, are uniquely 
determined by the following conditions: 


PsjkM Pjk(fi Mi® sm) ( 15 ) 

for s in S such that s(l) < s < s(A), p S jkM = 0 for Uj(k) > Uj _ and s = s(l), 


PsjkM = Zji{fl M ) exp {-(3jk M ) 


(16) 


for s = s(l), Uj(k) = Uj_, PjkM coordinate ((j,k) of (3 M , and [eji{(3 M )\ 1 the sum of 
exp {—(3jkM) for k from 0 to rj — 1 for which Uj{k ) = Uj_, p S jkM = 0 for Uj(k ) < Uj + and 
s = s(A) ; 

PsjkM = e j2 {(3 M ) exp {-(3 jkM ) (17) 

for s = s(A), Uj(k) = Uj + , and [ej2((3 M )]~ 1 sum °f ex P(~/3jkM) for k from 0 to rj — 1 
for which Uj{k ) = Uj + , 

^ ^ PsPsjkM ^ ( P S m s jkC (1®) 

s£S s£S 

for 1 < j < q and 0 < k < rj — 1, and 

q rj -1 q r,— 1 

EE Uj ( k ) p s j kM EE u j(k)m S jkc = s (19) 

j= 1 fc=0 j =1 k=0 


for s in S 


Proof. The strong law of large numbers implies that n _1 f S jk converges almost surely to 
P s m S jkc■ Existence and uniqueness of 0 sM , f3 M , and p S jkM follow from standard results for 
log-linear models (Haberman, 1974, chaps. 2, 9). Results on almost sure convergence follow 
from general results on concave likelihood functions (Haberman, 1989). 

To interpret the limit parameters 6 s m , f3 M , and p S jkM , logarithmic penalty functions 
may be employed (Gilula & Haberman, 1994, 1995). Let 

Tj-l 

77j(x, y) = - ^x k log (y k ) 

k =0 
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for r'j-climensional vectors x and y with respective nonnegative coordinates Xj, and ijk, 
0<k< r 3 — 1, such that 

rj-l rj -1 

Xk = yk = L 
k=0 k=0 

In the definition of L/j(x, y), OlogO = 0. Consider probability prediction of the responses 
Yi from the sums S i under the incorrect model that, conditional on Si = s, s in S, the Y (J1 
1 < j < q, are independently distributed with probability 

T^sjk Pjk (/^0 1 ^so) 

that Yij = k for unknown real parameters s in «5, and (3 0 in A. Let m S jc be the 
rj -dimensional vector with coordinates m S jkc, and let tv s j be the r^-dimensional vector with 
coordinates 7v S jk . The expected logarithmic penalty per item is 

<2 

EE k . v ). 

s£S j= 1 

Let p S jM be the rj-dimensional vector with coordinates p S jkM for 1 < 7 < r 3 . Then the 
minimum expected penalty per item is 

g 

r j = P s H(m sjC , PsjAt)- 

ses j=l 

The expected penalty per observation approaches Hj if 9 s0 approaches 9 s m, s E S, and /3 0 
approaches f3 M . Theorem 4 implies that the estimated expected log penalty function per 
item 

Hj =- Pjcm 

nq 

converges almost surely to Hj. 

Theorem 4 implies that inconsistency of (3 is observed when (3 and (3 M differ. This 
situation is typically but not necessarily the case, as is evident from previous work on the 
binary Rasch model (Andersen, 1973a). 

The expected logarithmic penalty per item Hj is at least as large as the conditional 
entropy measure per item 

_ <? 

H m = - q _1 ^2 Ps X] m sjc) 

ses j= i 
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that corresponds to the conditional entropy per item of Y tA given S i for a random variable 
A uniformly distributed on the integers 1 to q and independent of the Y t j, 1 < j < q. One 
has Hj = Hm if, and only if, p S jkM = rrisjkc- Let m S jc be the r^-dimensional vector with 
coordinates rh sjkC , 1 < k < rj, where m sjkC = f S jk/N s (s) for N s (s) > 0 and m sjkC = f+jk/n 
otherwise. The entropy per item Hm has an estimate 

H m = —— Y. Ns(s)Hj(ih S jc, UL ajC ) 

nC L s&S 

that converges almost surely to Hm- 

For an h^-dimensional vector x, let the maximum norm |x| be the maximum absolute 
value of the coordinates of x. As in the binary Rasch model (Haberman, 2004), the 
magnitude of the maximum norm | (3 M — (3 1 is of order q~ l . For a formal statement and 
proof of this claim, consider the following theorem in which the number of items is allowed 
to increase. 

Theorem 5 A real number r > 0 exists such that \(3 M — (3\ < r/q for all q > 1 and all 
items j, 1 < j < q, and values k, 1 < k < rj. 

Proof. To verify this claim, consider the difference between m S jkc and 

PsjkC Pjk(f3, @sc)i 

where p S j k c is uniquely defined by the condition that 

q 

5 2 u j(k)Psj C = s 
o = 1 

(Haberman, 1974, chap. 10). Let u S j be independent observations with probability p S j k c 
that L S j = k. The conditional probability m S j k c that Y tJ = k given that Si = s is then the 
conditional probability that u S j = k given that 

q 

L s ^ ^ iij {aj s j ) s. 

j = 1 

This latter probability is then 

H{io S j k)P (Lg u j(p J sj) s Uj,' (k)) jP(L S 5 ). 
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Let 


be the mean, 


be the variance, and 


l^sjC ^ ^ 'U j j(k)PsjkC 


k =0 


a sjC — 'y ^ [ u j( k ) Psjc] PsjkC 


k =0 


ri-1 


p 3 sjC y ^ [ttj (fc) PsjC . PsjkC 


k =0 


be the third central moment of Uj(K S j). If s, g, and n are selected so that 


a s+C ~ 


E 

3 = 1 


(T 


sj'C 


approaches oo, then 


and 


(Z/ s '5)/^"s+C 


[L s Uj(u) s j ) S + ^sjc]/( cr s+C a sjc) / 

converge in distribution to a standard normal random variable (Cramer, 1946, pp. 215-216). 
A refinement of this result permits approximation of rn SJ c (Haberman, 2004). To derive 
the desired approximations requires some simple modifications of results on Edgeworth 
expansions for lattice distributions (Esseen, 1945). Terms are used based on the normal 
density function and on its first three derivatives. Let 

<? 


i> s = 


l 


<y 


S+C A— 


Y sjC, 


3=1 


so that —'ijjk/dk+c is the skewness coefficient of L s . Let 

°sjC ~ i u j( k ) ~ Psjc } 2 ~ [Uj(k ) - 


A sjk 


^ a s+C 


It then follows that 


y c \ m sjkc - Psjkd 1 + A sifc )], 1 < j < q 
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is uniformly bounded. This result indicates that m S jkc ~PsjkC is °f order g _1 . Consider the 
conditional entropy Hb of Y' B given S\ and B for B uniformly distributed on the integers 1 
to q and Y- random variables for 1 < j < q such that P(Y'■ = A; | = k) = p s jkc■ Then 

__ <2 

Hb = -q- 1 Y Ps Y P sjc) 

s€S j=1 

and Hm differ by a term of order q~ x . 

To show that \(3 M — (3\ is of order q v requires use of fixed point theorems (Loomis 
& Sternberg, 1968, pp. 228-234). Consider solution of (18) for s in S subject to the 
constraints that (3 is in A and (15) and (19) hold. For an /? ry -dimensiorial vector x there is 
a unique real value g s (x), s in S, s(l) < s < s(A), for which 

g r o~ l 

EE Uj(k)p jk (x : g a (x)) = s. 

3=1 k =0 


Let 


let 


r i-1 

Wj{x,0) = 


k =0 


Y u j( k )pjk(*,o) , 

_k =0 


Psjk{ x ) =Pjk(x,ffa(x)), 


let 


let 


Msj(x) = Y u j( k )Psjk(x), 
k =0 

Wsj(x) = Wj( x ,g s ( x )), 


and let 


g 

w s +(x) = 5^w S j(x). 

3 =1 


The function g s is infinitely differentiable, and the partial derivative of g s (x) with respect 
to Xjk, the coordinate ((j, k) of x, is 


g S jk ( x ) 


1 

w s+ (x) 


u.j(k) - /i si (x)]p sifc (x). 
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For s = s(l) and u 3 {k) = Uj-, let 

P SJ fc(x) = e jl (x)exp {-x jk ), 
and for s = s_ and u 3 {k) > Uj_, let 

p sifc (x) = 0. 

For s = s(A) and ukj(k ) = let 

p S jfc(x) = e 2 (xj) exp(—Xjfc), 
while for s = s(A) and Uj(k) < Uj + , let 

p sjk (-x.) = 0 . 


Let F(x, y) be defined for x and y, x an i? 9 -dimensional vector with coordinate ((j,k) 
equal to Xj k , 1 < j < q, 0 < k < rj — 1, and y an i? g y4-dimensional vector with coordinate 
R q (a — 1) + ((j, k) equal to y s (a)jk, 1 < a < A, 1 < j < q, 0 < k < rj 1, so that F(x, y) is 
the /fq-dimensional vector with coordinate Q(j, k ) equal to 


F jfc (x,y) = Y^[y sj +Psjk(x) ~ Vsjk} 
ses 

for 1 < j < q and 0 < k < r 3 — 1. Here 

rj -i 

Vsj+ = 'y ^ Vsjk- 
k=0 

Then 

F(Az) = 0 


for z sjk = PsPsjkC , and 


f (/3m, y 7 ) = o 


for y' with y' s - k = P s p S jkM■ The conclusions of the theorem follow from application of 
fixed point theorems to F. Arguments are quite similar to those previously applied to the 
binary Rasch model (Haberman, 2004). As a consequence, only the required derivatives are 
described in the remainder of the proof. 
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The function F(x, y) of x and y is infinitely differentiable, ft is linear in the second 
argument y. The partial derivative of F jk with respect to xy k > is 

Hjkj'k' (x, y ) ^ ^ Usj + F sj kj 1 k' (x), 

s£S 

where F vk y k ' (x) is defined in the following fashion. For s in S and s(l) < s < s(A), 

Fsjkj'k ’( x ) = -y S j+Psjk(x-){8j(j')[5k(k') - p sjk >(x.)\ 

+ [uj(k ) - y, S j{x)][uj'(k') - /v( x )Kyjfc'(x)/w s +(x)}. 

For s = s(l) or s = s(A), 

F ajk j'w { x ) = -ysj+Psjk(x)5j(f)[h(k') -p sj k>(x)}. 


Given the definitions of 9 s m and 9 s c and the properties of g s , it also follows that 
9 s m — 9 s c and p S jkM — Psjkc are of order q _1 if s/q converges to a constant greater than s*_ 
and less than si. More precise expressions for these differences can be obtained but are not 
especially attractive. 

A variety of entropy measures are closely linked. The difference FIj — Hm is of order 
q~ 2 , so that FIj — Hb is of order q -1 . Let p,j be the r^-dimensional vector with coordinates 
Pijk for 1 < k < r,j . With a similar argument based on the normal approximation for the 
distribution of S\ given 9i, it follows that H B — H g is of order q if 

He = -q~ l Y H(Hj(pij, Pi j)) 
i =i 

is the conditional entropy per item of Y x given 9\. 

The assumption that the numerator Uj n (k ) and denominator Ujd{k ) are uniformly 
bounded implies that an integer u > 0 exists such that A < uq elements for each value of q. 
The conditional entropy per item 

H +e = -q- 1 Y E (( P (^i = lo S P (^i = ^0) 

s£S 

of Si given 9i and the unconditional entropy per item 

H + = -q- 1 YPslogP s 

s£S 
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of Si cannot exceed q 1 log (uq). It follows that the conditional entropy per item 

H C = -q^^Ps Y PJci c ) log Pjc(c) 
sgs cer(s) 

= Ho-H+o 

of Yi given Si and 9i differs from Hg by a term of order g -1 logg. The conditional 
distribution of Yi given S\ and 9\ is assumed independent of 6 1 , so that He is also the 
conditional entropy per item of Yi given S\. It follows that He differs from Hb, Hj, and 
Hm by terms of order q logg. The unconditional entropy per item 

Hu = -(f' Y Y Pj( c ) 1 °SPj(c) 

s£S cer(s) 

= H c + H + 

of Y x differs from H d) He , Hj, H, and H M by terms of order g _1 logg. 

Consistency if the number of items increases. Given that the bias magnitude \(3 M — j3\ is 
reduced as q increases, there is the suggestion that the inconsistency of the joint maximum 
likelihood estimators for the Rasch model can be removed if the asymptotic framework 
is changed so that both the sample size n and the number of items q both approach 
infinity (Haberman, 1977, 2004). The previous argument with fixed-point theorems is easily 
modified. The normal approximations for the sums 

Yif, k - N s (s)m sjkC ] 

s£S 

and large-deviation arguments may be used to demonstrate that the probability approaches 
1 that |/3 — (3 m \ and \j3 — /3\ both converge in probability to 0. 

Arguments from the binary Rasch model may be applied virtually without change to 
study the distribution of 9\ (Haberman, 2004). Both 9 s m — 9 s c and p S jkM — PsjkC converge 
in probability to 0 if s/q converges to a constant greater than s*_ and less than s* + . In turn, 
it follows that, for any specific individual i, 9{ converges in probability to 9{. Thus for any 
real 5 > 0, the fraction of examinees i < n with 1 9i — 9f\ >6 converges in probability to 0. 
To estimate the distribution function D of the random variable 9 ;, let D be the empirical 
distribution function of the 9i, so that D(x) is the fraction of the that do not exceed the 
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real number x. If D is continuous at x, then | D(x) — D(x) | converges in probability to 0. If 
h is a continuous or piecewise-continuous bounded function on the extended real line and h 
is continuous at 9\ with a probability of 1, then 

n 

E(h(9))=n~ 1 J2H9i) 

i =1 

converges in probability to E(h{9\)). If the distribution function D is continuous, as is the 
case for 6i a continuous random variable, then 

\D — D\ — sup | D(x) — D(x )| 

X 

converges in probability to 0. 

The difference Hj — Hu then converges in probability to 0, so that the various conditional 
entropies under study can be estimated. The difference Hm — Hm can only be expected to 
converge in probability to 0 if q 2 /n approaches 0. 

To ensure that all 9 t are finite requires the condition that nP s ( p and nP s (A) both 
approach 0. This condition will certainly hold if g _1 logn approaches 0 (Haberman, 1977). 
In this case, the probability approaches 1 that 

max 1 0i — 6 A 

1 <i<n 

and 

max max max I Pnk~ Pak\ 

I<i<nl<j<q0<k<rj-1 F J J 

converge in probability to 0. 

For the TOEFL example, the consistency results are fairly satisfactory. The sample 
size of n = 2, 720 is large enough so that the basic consistency results for (3 are not a 
problem if the model is correct. Because q — 71 and q log n is 0.111, there is reason for 
concern about the results that involve consistency of the ability estimates, for 0.111 is 
not that small a number. This concern is justified to the extent that an observation does 
exist in the sample for which the ability estimate is oo. A further constraint exists in that 
q~ 1 \ogq = 0.060 is not especially small, so that the unconditional entropy Hu is not well 
estimated. Results are presumably worse if the 71 items are divided into the 38 items from 
the reading test and the 33 items from the listening test. 
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2.5 Normal Approximations 

The bias issues already noted in the discussion of consistency have an unusual effect on 
normal approximations. It is relatively easy to find a normal approximation for the joint 
maximum-likelihood estimate (5j k of the item parameter / 3j k , but this approximation is often 
not satisfactory because the asymptotic mean is / 3j k M, coordinate ((j,k) of /3 M , rather 
than (5jk- A normal approximation for 0 lt is available with relatively little difficulty for q 
large, but there are problems in practice with the accuracy achieved. Results are rather 
straightforward generalizations of those for the binary case (Haberman, 2004). 

If q is constant and n becomes large, then a normal approximation is available for /3j k 
but not for 9 t . The normal approximation is derived by conventional arguments based on 
the function F developed in Section 2.4. Once again, fixed point theorems are employed as 
in the binary Rasch model (Haberman, 2004). Let Zj k be the adjusted random variable 
with value Z tjk — p S j k M for S). = s. Let V + be the covariance matrix of the R q - dimensional 
vector Z f with coordinate ((j, k ) equal to Zf- k for 1 < j < q and 0 < k < rj — 1. Let V^, k , 
be the covariance of Zj- k and Zj-, k ,. Let 

rj -i 

psjM ^ ^ Uj(k)p s jkM 
k= 0 

and 

r.j-l 

®sjM ^ ^ [Uj ( k ) HfijM]PsjkM 

k =0 

be the variance of a random variable that is Uj{k ) with probability p S jkM, let 

a +jM = P s a 2 S j M , 
s£S 

q 

2 \ A 2 
a s+M = /_^ a sjMi 

3=1 

let S' be the set of s in S that are neither s(l) nor s(H), let 


jkj'k' 


y ^ PsSj (j )PsjkM [ 4 /,--(k ) Psj’k’ m\i 

s£S 


let 


T- 


2 jkj'k' 


E* 

seS' 


PsjkMPsj'k'M j (k) jl s j A/] \llj' ( k ) Psj'k'M 


a 


s+M 
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and let 


bb jkj'k' T-ljkj'k' -^2 jkj'k'- 

Let W be the R q by R q matrix with row ((j, k ) and column £( j ', k') equal to Wjkj'k'- Note 
that 

Vj-l Tj-l 

Wjkj'k' — ^2 Vjkj'k' = 0 


k '=o 


and 


k '=o 

q r-j -1 


q r-j -1 _ 

u j'{k )wjkj'k' ~ u j'(k )v = o 

i 7 / =l fc'=0 ^'=1 &'=() 

and W and V + are symmetric and positive semi-definite. Let W + be the R q by R q matrix 
with row ((j, k) and column ((j', k') equal to 

w fkj'k> - w Jkj'k' + 6j(j') + - hilM/d) - hi]. 

Then n 1//2 (/3 — (3 M ) converges in distribution to a multivariate normal random vector with 
mean 0 and covariance matrix (W + )~ 1 V + (W + )~ 1 . The notable problem is that the normal 
approximation involves (3 M rather than (3. 

If the number q of items increases, then normal approximations remain available, but a 

few changes in results are needed due to the changing dimension of (3. Let 

/ Mgjkj'k' ((3) . . . . 

Vsjkj’k' {(3) = — T-vT - m sjk (f3)m S j' k '((3), 


M s ((3) 


where 


M ajk j'k'(P) = J] c i c i' ex P[-/3 , Z(c)]. 

cer(s) 

Thus v S jkj'k'{(3 ) is the conditional covariance of Z\ 3 k and given S\ = s. Arguments 

similar to those applied for m S jkc m ay be used to show that 

a t+c[ v sjkjk(P ) - crljc ~ ( 2 p S j W ~ 1 )PsjkC&sjk], 1 < j < q, 0 < k < Tj — 1 7 

&s+C \Zsjkjk' (/3) T PsjkCPsjk’C^ 1 T ^sjk T ^sjfe')], 1 A j A Qi 0 ^ k k 5; Vj 1, 


and 


^s+C^Vsjkj'k' (/3) T PsjkCPsj'k'c\^ji}^) l^sjkc\ \^j' ) psj'k'c\ /& s +c\i 

1 < j < j' < <1, 0 < k < Tj — 1, 0 < k' < rji — 1, 
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are uniformly bounded as cr 2 +c approaches oo. 

Let Q be an integer constant greater than 1. For q > Q, let (3q be the R^-dimensional 
vector with coordinate ((j, k ) of /3jk for 1 < j < Q and 0 < k < Tj — 1, let /3q M be the 
Rq- dimensional vector with coordinate ((j,k) of (3jkM for 1 < j < Q and 0 < k < Tj — 1, 
and let / 3q be the Rq- dimensional vector with coordinate ( (j, k) equal to / 3jk for 1 < j < Q 
and 0 < k < Tj — 1. Let Tq be the Rq by Rq matrix with row ((j, k ) and column ((j 1 , k') 
equal to 

'Rj kj'k' E(Sj{j )pijk [A,: (k ) Pij ! k '\) • 

Let Tq be the Rq by Rq matrix with row ((j, k) and column ((j', k ') equal to 

Rjkj'k 1 = fijU ) + Rjkj'k' ■ 

Let Kq be the Rq by Rq matrix with row ((j, k ) and column £(/, k') equal to 

Rjkj'k' ^j (j )Sk(k ) F | (j ) UjkUy'k/ ■ 

Arguments can be used similar to those for the binary Rasch model (Haberman, 2004). Use 
of the maximum norm shows that n 1 / 2 (/3g — (3q M ) converges in distribution to a multivariate 
normal random vector with zero mean and covariance matrix K q (TS^TqCTJ)- 1 !^, 
where the prime symbol is used to denote a transpose. 

In practice, the asymptotic normality result is somewhat unsatisfactory. Clearly (3jk 
is intended to estimate (3jk rather than [Rjum- If n/q 2 approaches 0, then r i 1//2 (/3q — (3q) 
converges in distribution to a multivariate normal random vector with mean 0 and 
covariance matrix K«(TJ)-'T q (T+)-‘ K q. Nonetheless, it is far from clear that the 
asymptotic approximation is adequate for the example under study, for n/q 2 is 0.540 is not 
a small number. The problem is likely to be much more severe with tests of similar length 
in which an administration may involve around 500,000 examinees. As a practical matter, 
the results indicate that ordinary asymptotic confidence intervals for /3jk cannot be derived 
by use of the normal approximation for f3jk- 

In the case of an individual % for an increasing number q of items, the normal 
approximation for 9{ is relatively straightforward. Arguments for the binary Rasch model 
apply with only minor modifications (Haberman, 2004). One finds that (9i — 0j)/cr{9j) 
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converges in distribution to a standard normal random variable if cr(9i) is the inverse of 

j = 1 

r 3 

a ij = ^2Pijk[Uj(k) - Hij} 2 , 


k= 1 


and 


pij — 'y ] uj(k)pi jk . 


k =1 


In addition, for Q a finite integer, the Q-dimensional vector with coordinates (§i — 6i)/a{6i) 
for 1 < i < Q converges in distribution to a multivariate normal random vector with mean 
0 and covariance matrix I. 

Approximate confidence intervals are available. The probability that 


§i - za(Oi) < 6i <6i + za(§i) 


approaches 1 — a if 


oiOi) = l/<7j+, 


q 



3 =1 


&ij — 'y ] Pijk\ u j(k ) p>ij] , 
k= 1 

and 

r 3 

fiij ^ ^ Uj(k)p ijk . 

k= 1 

For the TOEFL example, the estimated §i range from —2.632 to 4.326 for the cases 
with finite estimates. One value of §i is oo. The observed estimates range from 0.248 
for scores S, from 38 to 41 to 1.009 for Si = 78. The value of is taken to be oo 
for Si = 79. The lower quartile for the 6i is -0.448, and the upper quartile is 1.164. The 
estimates of asymptotic standard deviations suggest some limitations in the quality of the 
normal approximations. 

Normal approximations for Hj and Hm are somewhat unsatisfactory in practice due to 
the relatively large estimation biases involved. 
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3. Conditional Maximum Likelihood 

Conditional maximum-likelihood estimation is applicable to the nominal model 
(Andersen, 1983), and conditional maximum-likelihood is closely related to marginal 
maximum likelihood. As shown in this section, conditional maximum likelihood is quite 
effective in large samples whether or not the number of items is large, and computation 
of conditional maximum-likelihood estimates is relatively straightforward. In conditional 
maximum likelihood, inference is conditional on the observed examinee sums S). For c 
in r(s) and for s in S, the conditional probability pjc(c) that Y* = c given that S) = s 
satisfies 

Pjc(c) = pj{c)/P s . 

Under the nominal model, 

P s = M s ((3 ) exp(r s ), 

so that 

Pjc{ c) = exp(-/3'c)/M s (/3) (20) 

does not depend on the distribution function D of the ability 9\. The conditional log 
likelihood function is then 

n 

^c(pjc) = 5 ^ logPjc, ( Y *) 

i =1 

for the array p jc of pjc( c) for c in T. Thus 

^c(Pjc) = -PZ+ ~ y> s logAA(/3). 

s£S 

Because lc is determined by the f S j k , inferences again may be based on the collapsed table. 

As in the binary Rasch model (Haberman, 2004), the relationship of conditional and 
marginal maximum likelihood is relatively simple. Let P be the array with coordinates P s 
for s in S, and let 

ts(P) = J2 N s( s ) l °Z p s 

s£S 

be the marginal log likelihood for the examinee totals S t , 1 < % < n, under the unrestricted 
model that Si = s with probability P s for some nonnegative P s such that J2 s£S P s = 1- 
Then 

^(p j) = ^c(pjc) + 4(P)- 
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Let Em denote the maximum of the log likelihood E(pj) under the condition that p j is in 
the set S corresponding to the nominal model, and let £m+ denote the maximum of E(p.j) 
under the assumption that p,/ is in the set S + that corresponds to the log-linear extension 
of the nominal model. Obviously Em < Em+■ Let Ecm be the maximum of Ecip.jc) under 
the constraint that (20) holds for some /3 in A. Let Ism be the maximum 

Y N s( s ) log[ N s{s)/n] 

s£S 

of Es( P) (OlogO is taken to be 0). Then 

Em < Em+ = Ecm + Esm- 


Thus conditional maximum likelihood corresponds to ordinary maximum likelihood for the 
log-linear extension of the nominal model. 

The conditional maximum-likelihood estimate (3* of (3, if it exists, is the element of A 
such that 

Pjc( c ) = exp (— /3* c) / M s (/3* ), 

and 

Ec{Vjc) = Ecm- 

If (3* exists, then it satishes the conditional maximum-likelihood equations 

rhsjkc rrisjkiP*) 


and 

^ ^ At ( s ) Til s j fc c Z + jk 

s£S 

for 1 < j < q and 0 < k < rj — 1. Conversely, if (3* is a vector in A and 

Y sN s(s)m ajk ((3J = Z +jk , 

s£S 

then (3* is a conditional maximum-likelihood estimate of (3. Provided that S' is nonempty, 
no more than one conditional maximum-likelihood estimate (3* exists. 

Existence of conditional maximum-likelihood estimates is an issue, although normally 
a much less important one than in the case of joint estimation. Consider the following 
theorem (Haberman, 1974, chaps. 2, 7) 
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Theorem 6 In the case of S' nonempty, the estimate (3* exists if, and only if, g S j k < 0 can 
be found for s in S + , 0 < k < r 3 — 1, and 1 < j < q, such that g S j k > 0 for s in S + if c 
in T(s) exists such that Cj = k, g S j k = 0 otherwise, g + j k = f+jk for 0 < k < Tj — 1, and 
1 <j<q, and E' =1 Elio 1 9sjk = qN s (s) for s in S+. 

It follows that conditional maximum-likelihood estimates exist whenever joint 
maximum-likelihood estimates exist. 

Extended conditional maximum-likelihood estimates may be considered if /3* does not 
exist. There are rh S j k c in [0,1] such that 

^ 1 ^S^yhlsjkC Z-\-jki 

s£S 

q Tj-1 

EE Uj(k)m sj kc = s, 

j =1 k =0 

and m S jk(/3) approaches m S jkc for Ns(s) > 0 if (20) holds and £c{pjc) approaches 
Icm- If the conditional maximum-likelihood estimate exists, then m S jkc = m S jk(/3*)- 
Various conventions can be considered to define f3* in the case in which no conditional 
maximum-likelihood estimate exists for (3. 

Given the estimate 0 it is possible to estimate the examinee abilities 0*. For each i, 
the log likelihood for 9 t given the f3j k * is 

q Tj — l 

EE ^ijk logPjfc(/3*,6t). 

j =1 k =0 

Given the dehnition of g s in the proof of Theorem 5, it follows that the estimate 6 of 9, is 

0 s c* = 9 s ((3 J 

for s(l) < s = Si < s(A). For Si = s = s(l), 9 iif = 9 s c* = —oo. For Si = s(A), 

9i* 0 sC* oo. 

3.1 Large-Sample Properties 

For q fixed, if the nominal model is valid and n becomes large, then there is no 
difficulty in demonstrating that (3 C is a consistent and asymptotically normal estimate for 
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(3 (Haberman, 1977). In the case of q increasing, a bit more argument is required. Consider 
integers j, j', k, and k' such that 0 < k < rj — 1, 0 < k 1 < ry — 1, 1 < j < q, and 1 < f < q. 
Let 

Vjkj'k'C Vj ifcj'fc'(/3) ^ ^ Ng (s')V s jkj'i c ' (/3) 

s£S 

be the conditional covariance of Z +qk and Z + jt k > given the Ng(s), s in S. Let Vg = V(/3) 
be the R q by R q matrix with row ((j,k) and column C(j',k') equal to Vjkj'k'iP)- This 
matrix is of rank R q — q — 1 if S' is nonempty. Let Vg be the expected value of n _1 Vg, 
so that Vg is obtained from Vg by substitution of P s for Ns(s). Note that if Z* k is the 
random variable equal to Z ljk — m S j k c f° r Sj = s and if Z* is the i?Q-dimensional vector 
with coordinate ((j, k ) equal to Z*- k for 1 < j < q and 0 < k < r q — 1, then is the 
covariance matrix of Z* for each observation i. Let be the R q by R q matrix with row 
C(j, k) and column Q (/, k') equal to 

v jkj'k’c = V jkj'k'C + W) + Hj)W)[M k ) - hi][ui(/c') - hi]. 

Arguments rather similar to those applied in the case of joint maximum-likelihood 
estimation may also be applied to conditional maximum-likelihood estimation. If the 
number q of items is fixed, then /3* converges almost surely to (3 and n — (3) converges 
in distribution to a multivariate normal random variable with mean 0 and covariance 
matrix (V^)' 1 Vp(Vj)^ 1 . If q approaches oo, then | (3* — (3\ converges in probability to 0. 
For an integer Q > 1, let f3j k * be coordinate £( j , k) of {3* and let (3q* be the i?Q-dimensional 
vector with coordinate ((j,k) equal to (3j k * for 1 < j < Q and 0 < k < rj — 1. Then 
nl//2 (Ag* — (3q) converges in distribution to a multivariate normal random vector with mean 
0 and the covariance matrix Kg (Tj)->T e (T+)-‘ Kg encountered in the discussion of the 
normal approximation for n 1 / 2 (/3g — (3q M ). As in the binary Rasch model, conditional 
estimation has the major advantage that the asymptotic approximations involve the actual 
parameters of interest, namely the f3j k , rather than the 0-jkM parameters. It should be noted 
that Kq(T+)- 1 Tq(T+)- 1 K , q is the limit of the matrix formed from the first Rq rows and 
columns of (Vg) _1 V^(Vj) _1 . Let Vg be V(/3J, and let Vj be the R q by R q matrix with 
row ((j, k ) and column ((j', k') equal to 

Vjkj'k'C = Vjkj'k'dP *) + n6j(j') + ~ ^[u^k') - hi] 
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for integers j, j', k, and k! such that 0 < k < rj — 1, 0 < k' < ry — 1, 1 < j < q, 
and 1 < j' < q. Then both for q fixed and q increasing, asymptotic confidence intervals 
for parameters such as /3jk are easily constructed by estimation of the asymptotic 
standard deviation s(/3jk*) of (3jk * by the square root of row ((j, k) and column ((j, k) of 
(Vj) _1 Vc'(Vj)^ 1 . Thus results are quite similar to those for the binary Rasch model 
(Haberman, 2004). 

If q approaches oo, then the asymptotic properties of 9 iir are essentially the same as those 
for 9i as far as consistency, asymptotic normality, and approximate confidence intervals are 
concerned. Estimation of the distribution of 9 1 can be implemented in essentially the same 
fashion as in JMLE by substitution of 9 ** for d*. 

Estimation of the entropy measures He and Hu involves relatively little difficulty, for 
He may be estimated by 

HcN =-^ CM, 

nq 

H + may be estimated by 


H+ = —— V N s (s) log[7V s (s)/n], 
nq 

H s&S 


and Hu may be estimated by 


Hun — Hcn + H + . 


For q constant, Hcn converges almost surely to H c , H+ converges almost surely to H + , 
and Hun converges almost surely to Hu- For q increasing, Hcn — He, H+ — H + , and 
Hun — Hu all converge in probability to 0. Normal approximations are readily available, 
at least if q/n approaches 0. Let o(Hu) be the standard deviation of q~ l logpj(Yi), and 
let <y(Hc) be the standard deviation of g” 1 logpj C (Y 1 ). Let cr(Hc) and cr(Hu) be positive, 
and assume that neither approaches 0 if q approaches oo. Then n l ^ 2 {HuN ~ Hu) / a(Hu) 
and n 1 / 2 (iLcAf — Hc)/<j(Hc ) both converges in distribution to a standard normal random 
variable. These results are readily applied to construction of approximate conhdence 
intervals for He and Hu (Gilula & Haberman, 1995). 
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3.2 The Newton-Raphson Algorithm 

The Newton-Raphson algorithm for conditional estimation for the nominal model is 
rather similar to the Newton-Raphson algorithm for conditional estimation for the binary 
Rasch model (Andersen, 1972, 1983; Haberman, 2004). One begins with a preliminary 
approximation j3 0 to (3*. One possibility is (3. One then uses the iterations 

ft+i =ft-(V+)-'[Z + -m +1 ], 

where m +i is the i? g -dimensional vector with coordinate C,(j, k ) equal to 

m +jk(Pt) = J ^n s m sjkC (/3 t ) 

s£S 

for 0 < k < Tj — 1 and 1 < j < q and is the R q by R q matrix with row C(j, k) and 
column C(j', k') equal to 

v jkfk't = Vjkj'k'iPt) + nSj(j') + - ui][ui(k') - hi] 

for integers j, j', k, and k! such that 0 < k < rj — 1, 0 < k! < ry — 1, 1 < j < q, and 
i <f< q'- In typical cases, (3 t converges quite rapidly to (3*. 

Even more than for the Newton-Raphson algorithm for the binary Rasch model, 
implementation of the Newton-Raphson algorithm is challenging for a large number q of 
items. For efficient computation, consider random variables u jt and L t defined so that the 
ujy are independent for 1 < j < q, L t — 00 'it assumes integer values from 0 to 

Tj — 1, and cjjjt = k with probability 

Pjkt = Pjk((3 jt , 0). 

As in the proof of Theorem 5, 

m sjkt = m sjk ((3 t ) = p jkt P{L t - uj{ u jt ) = s- Uj(k))/P(L t = s ). 

Similarly, v S]k y kJt = v sjk y k >((3 t ) satisfies 

^sjkjkt ^Hsjkt (I ™sjkt)r 
Vsjkjk’t = sjktn^sjk'ti k 7 ^ k , 
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and 


PsjktPsj'k'tP '(-^t (^) '(^O • / •/ 

Vsjkj’k't = dTT \ ^7. sjkt^sj'k't j j r J ■ 

r\L t = SJ 

At this point, probabilities such as P(L t = s ) may be computed by use of a recursion 
formula. Let 5,: be the set of possible sums ]G* = | Uj(k) for 0 < k < rj — 1 for 1 < j < i. Let 
5 0 be the set with element 0. Let a t (h,i ) be the probability that — h f° r h in 5, 

and 1 < i < q, and let a(s,0, 0) = 1. For h in S t and 1 < i < q, let K(h,i ) be the set of 
integers k from 1 to r t such that h — Ui(k) is in Si- 1 . Then 

a t (h, i) = ^ Pjkt a t(h - Ui(h),i - 1), 

keK(h,i) 

and P(L t = s) is a t (s, q). 

Given that this recursion procedure is employed with double precision arithmetic, no 
major computational problems are encountered. The initial values from joint estimation 
are quite effective as starting values, for (3jk and f5jk* have no difference that exceed 0.06 
in magnitude for the data from Form A of the TOEFL held trial, and most differences are 
much smaller in magnitude. The estimated asymptotic standard deviations range roughly 
from 0.03 to 0.10, so that the differences between /3jk and j3jk * can be large enough to raise 
some questions about the quality of large-sample approximations for JMLE. 

4. Conclusions 

The results derived in the preceding sections suggest that CMLE provides an effective 
approach for analysis of the nominal model even in cases in which both the sample size and 
the number of items are large. Standard large-sample approximations for the distributions 
of conditional maximum-likelihood estimates have been shown to apply, so that asymptotic 
confidence intervals are available. 

Efforts have also been made to apply JMLE under realistic conditions. Results have 
been somewhat less satisfactory for the TOEFL example. 

This report does not treat all important issues for the nominal model. Goodness of fit is 
an issue, and behavior of estimates when the model fails is important. The measurement of 
the size model error should also be explored. Generalizations of the model that are similar 
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to 2PL models are of interest, and use of restricted ability distributions can be explored as 
in conventional applications of marginal maximum likelihood. 
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